113 research outputs found

    Vers un solveur de systèmes linéaires creux adapté aux machines NUMA

    Get PDF
    National audienceLes solveurs de systèmes linéaires ont fait d'énormes progrès au cours des dernières années et commencent désormais à exploiter les architectures multi-coeurs et leur mémoire partagée. Le solveur PaStiX est développé dans une version hybride MPI/thread pour gagner en consommation mémoire sur les buffers de communications. On souhaite désormais l'adapter aux architectures NUMA et disposer d'un ordonnancement dynamique pour ces architectures car les modèles de coût ne peuvent intégrer complètement les caractéristiques de ces architectures. On étudiera dans une première partie l'importance d'une nouvelle gestion mémoire pour les architectures NUMA, puis comment nous avons implémenté cet ordonnancement dynamique. Des résultats illustrerons nos travaux et nous finirons par l'étude d'un cas challenge correspondant à un problème à 10 millions d'inconnues

    Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

    Get PDF
    The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014

    A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

    Get PDF
    International audienceOver the past few years, parallel sparse direct solvers have made significant progress. They are now able to solve efficiently real-life three-dimensional problems with several millions of equations. Nevertheless, the need of a large amount of memory is often a bottleneck in these methods. The authors have proposed an hybrid MPI-thread implementation of a direct solver that is well suited for SMP nodes or modern multi-core architectures. Modern multi-processing architectures are commonly based on shared memory systems with a NUMA behavior. These computers are composed of several chip-sets including one or several cores associated to a memory bank. Such an architecture implies hierarchical memory access times from a given core to the different memory banks which do not exist on SMP nodes. Thus, the main data structure of our targeted application have been modified to be more suitable for NUMA architectures. We also introduce a simple way of dynamically schedule an application based on a dependency tree while taking into account NUMA effects. Results obtained with these modifications are illustrated by showing performances of the PaStiX solver on different platforms and matrices

    Designing LU-QR hybrid solvers for performance and stability

    Get PDF
    Abstract—This paper introduces hybrid LU-QR algorithms for solving dense linear systems of the form Ax = b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps, based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of floatingpoint operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. Ideally, the choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the PaRSEC software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. I

    A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

    Get PDF
    Over the past few years, parallel sparse direct solvers made significant progress and are now able to solve efficiently industrial three-dimensional problems with several millions of unknowns. To solve efficiently these problems, PaStiX and WSMP solvers for example, provide an hybrid MPI-thread implementation well suited for SMP nodes or multi-core architectures. It enables to drastically reduce the memory overhead of the factorization and improve the scalability of the algorithms. However, today's modern architectures introduce new hierarchical memory accesses that are not handle in these solvers. We present in this paper three improvements on PaStiX solver to improve the performance on modern architectures : memory allocation, communication overlap and dynamic scheduling and some results on numerical test cases will be presented to prove the efficiency of the approach on NUMA architectures

    Atteindre la qualité d’une SVD avec une compression de rang faible pour différentes variantes de la factorisation QR

    Get PDF
    Solving linear equations of type Ax=bAx=b for large sparse systems frequently emerges in science/engineering applications, which is the main bottleneck. In spite that the direct methods are costly in time and memory consumption, they are still the most robust way to solve these systems. Nowadays, increasing the amount of computational units for the supercomputers became trendy, while the memory available per core is reduced. Thus, when solving these linear equations, memory reduction becomes as important as time reduction. For this purpose, compression methods are introduced within sparse solvers to reduce both the memory and time consumption. In this respect, Singular Value Decomposition (SVD) is used to reach the smallest possible rank, but it is too costly in practice. Rank revealing QR decomposition variants are used as faster alternatives, which can introduce larger ranks. Among these variants, column pivoting or matrix rotation can be applied on the matrix AA, such that the most important information in the matrix is gathered to the leftmost columns and the remaining unnecessary information can be omitted. For reducing the communication cost of the QR decomposition with column pivoting, blocking versions with randomization are suggested as an alternative to find the pivots. In these randomized variants, the matrix AA is projected on a lower dimensional matrix by using an i.i.d. Gaussian matrix so that the pivoting/rotational matrix can be computed on the lower dimensional matrix. In addition, to avoid unnecessary updates of the trailing matrix at each iteration, a truncated randomized method is suggested to be more efficient for larger matrix sizes. Thanks to these methods, closer results to SVD are obtained with reduced compression cost. In this report, we compare all these methods in terms of complexity, numerical stability, obtained rank, performance and accuracy.La résolution d'équations linéaires de type Ax=bAx=b pour de grands systèmes creux apparaît fréquemment dans les applications scientifiques et constitue le principal goulot d'étranglement.Bien que les méthodes directes soient coûteuses en temps et en mémoire, elles restent la méthode la plus robuste pour résoudre certains de ces systèmes.De nos jours, on note une augmentation du nombre d'unités de calcul pour les superordinateurs alors que la mémoire disponible par cœur est réduite.Par conséquent, lors de la résolution de ces systèmes linéaires, la réduction de la mémoire devient aussi importante que la réduction du temps.À cette fin, des méthodes de compression pour les blocs denses, apparaissant lors de la factorisation des matrices creuses, ont été introduites pour réduire la consommation de mémoire, ainsi que le temps de résolution.En recherchant le rang de compression le plus faible possible, la décomposition en valeurs singulières (SVD) donne le résultat optimal.Elle est cependant trop coûteuse et nécessite une factorisation complète pour trouver le rang résultant.Les variantes de la décomposition QR sont moins coûteuses, mais peuvent conduire à des rangs plus importants.Parmi ces variantes, le pivotage des colonnes ou l'application d'une rotation peuvent être appliqués à la matrice AA, de telle sorte que les informations les plus importantes de la matrice soient rassemblées dans les premières colonnes et de permettre de négliger le reste de la sous-matrice.Pour réduire le coût de communication de la décomposition QR classique avec pivotage des colonnes, des versions par bloc et utilisant des techniques de randomisation sont proposées comme solution alternative pour trouver les pivots.Dans ces variantes randomisées, la matrice AA est projetée sur une matrice de dimension beaucoup plus faible en utilisant une matrice gaussienne dont chaque variable aléatoire a la même distribution de probabilité que les autres et toutes sont mutuellement indépendantes.Ainsi, la matrice de pivotage/rotation peut être calculée en dimensions réduites.En outre, pour éviter les mises à jour inutiles sur la sous-matrice à chaque itération, une méthode randomisée tronquée est proposée et s'avère plus efficace pour les matrices de grande taille.Grâce à ces méthodes, des résultats proches de SVD peuvent être obtenus et le coût de la compression peut être réduit.Dans ce rapport, nous comparerons toutes ces méthodes en termes de complexité, stabilité numérique, rang obtenu, performance et précision

    Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures

    Get PDF
    International audienceComputing eigenpairs of a symmetric matrix is a problem arising in many industrial applications, including quantum physics and finite-elements computation for automo-biles. A classical approach is to reduce the matrix to tridiagonal form before computing eigenpairs of the tridiagonal matrix. Then, a back-transformation allows one to obtain the final solution. Parallelism issues of the reduction stage have already been tackled in different shared-memory libraries. In this article, we focus on solving the tridiagonal eigenproblem, and we describe a novel implementation of the Divide and Conquer algorithm. The algorithm is expressed as a sequential task-flow, scheduled in an out-of-order fashion by a dynamic runtime which allows the programmer to play with tasks granularity. The resulting implementation is between two and five times faster than the equivalent routine from the INTEL MKL library, and outperforms the best MRRR implementation for many matrices

    Reordering Strategy for Blocking Optimization in Sparse Linear Solvers

    Get PDF
    International audienceSolving sparse linear systems is a problem that arises in many scientific applications, and sparse direct solvers are a time-consuming and key kernel for those applications and for more advanced solvers such as hybrid direct-iterative solvers. For this reason, optimizing their performance on modern architectures is critical. The preprocessing steps of sparse direct solvers—ordering and block-symbolic factorization—are two major steps that lead to a reduced amount of computation and memory and to a better task granularity to reach a good level of performance when using BLAS kernels. With the advent of GPUs, the granularity of the block computation has become more important than ever. In this paper, we present a reordering strategy that increases this block granularity. This strategy relies on block-symbolic factorization to refine the ordering produced by tools such as Metis or Scotch, but it does not impact the number of operations required to solve the problem. We integrate this algorithm in the PaStiX solver and show an important reduction of the number of off-diagonal blocks on a large spectrum of matrices. This improvement leads to an increase in efficiency of up to 20% on GPUs. 1. Introduction. Many scientific applications, such as electromagnetism, astrophysics , and computational fluid dynamics, use numerical models that require solving linear systems of the form Ax = b. In those problems, the matrix A can be considered as either dense (almost no zero entries) or sparse (mostly zero entries). Due to multiple structural and numerical differences that appear in those problems, many different solutions exist to solve them. In this paper, we focus on problems leading to sparse systems with a symmetric pattern and, more specifically, on direct methods which factorize the matrix A in LL t , LDL t , or LU , with L, D, and U, respectively, unit lower triangular, diagonal, and upper triangular according to the problem numerical properties. Those sparse matrices appear mostly when discretizing partial differential equations (PDEs) on two-(2D) and three-(3D) dimensional finite element or finite volume meshes. The main issue with such factorizations is the fill-in—zero entries becoming nonzero—that appears in the factorized form of A during the execution of the algorithm. If not correctly considered, the fill-in can transform the sparse matrix into a dense one which might not fit in memory. In this context, sparse direct solvers rely on two important preprocessing steps to reduce this fill-in and control where it appears. The first one finds a suitable ordering of the unknowns that aims at minimizing the fill-in to limit the memory overhead and floating point operations (Flops) required to complete the factorization. The problem is then transformed into (P AP t)(P x) = P b
    • …
    corecore